Bermejo , Carolin Strobl Random forest Gini importance favors SNPs with large minor allele frequency

نویسندگان

  • Anne-Laure Boulesteix
  • Andreas Bender
  • Justo Lorenzo Bermejo
  • Carolin Strobl
چکیده

The use of random forests is increasingly common in genetic association studies. The variable importance measure (VIM) that is automatically calculated as a by-product of the algorithm is often used to rank polymorphisms with respect to their association with the investigated phenotype. Here we investigate a characteristic of this methodology that may be considered as an important pitfall, namely that common variants are systematically favored by the widely used Gini VIM. As a consequence, researchers may overlook rare variants that contribute to the missing heritability. The goal of the present paper is three-fold: 1) to assess this effect quantitatively using simulation studies for different types of random forests (classical random forests and conditional inference forests, that employ unbiased variable selection criteria) as well as for different importance measures (Gini and permutation-based), 2) to explore the trees and to compare the behaviour of random forests and the standard logistic regression model in order to understand the statistical mechanisms behind the preference for common variants, and 3) to summarize our results and previously investigated properties of random forest VIMs in the context of association studies and to make practical recommendations regarding the methodological choice. The codes implementing our study are available from the companion website: http://www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/020_professuren/boulesteix/ginibias/

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Random forest Gini importance favours SNPs with large minor allele frequency: impact, sources and recommendations

The use of random forests is increasingly common in genetic association studies. The variable importance measure (VIM) that is automatically calculated as a by-product of the algorithm is often used to rank polymorphisms with respect to their ability to predict the investigated phenotype. Here, we investigate a characteristic of this methodology that may be considered as an important pitfall, n...

متن کامل

Danger: High Power! – Exploring the Statistical Properties of a Test for Random Forest Variable Importance

Random forests have become a widely-used predictive model in many scientific disciplines within the past few years. Additionally, they are increasingly popular for assessing variable importance, e.g., in genetics and bioinformatics. We highlight both advantages and limitations of different variable importance scores and associated testing procedures. For the test of Breiman and Cutler (2008), w...

متن کامل

Zeileis Danger : High Power ! – Exploring the Statistical Properties of a Test for Random Forest Variable

Random forests have become a widely-used predictive model in many scientific disciplines within the past few years. Additionally, they are increasingly popular for assessing variable importance, e.g., in genetics and bioinformatics. We highlight both advantages and limitations of different variable importance scores and associated testing procedures, especially in the context of correlated pred...

متن کامل

Party on ! A New

Random forests are one of the most popular statistical learning algorithms, and a variety of methods for fitting random forests and related recursive partitioning approaches is available in R. This paper points out two important features of the random forest implementation cforest available in the party package: The resulting forests are unbiased and thus preferable to the randomForest implemen...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011